The DiaCORIS project: a diachronic corpus of written Italian
نویسندگان
چکیده
The DiaCORIS project aims at the construction of a diachronic corpus comprising written Italian texts produced between 1861 and 1945, extending the structure and the research possibilities of the synchronic 100-million word corpus CORIS/CODIS. A preliminary in depth study has been performed in order to design a representative and well balanced sample of the Italian language over a time period that contains all the main events of contemporary Italian history from the National Unification to the end of the Second World War. The paper describes in detail such design processes as the definition of the main subcorpora and their proportions, the type of documents inserted in each part of the corpus, the document annotation schema and the technological infrastructure designed to manage the corpus access as well as the web interface to corpus data.
منابع مشابه
Gearing the Discursive Practice to the Evolution of Discipline: Diachronic Corpus Analysis of Stance Markers in Research Articles’ Methodology Section
Despite widespread interest and research among applied linguists to explore metadiscourse use, very little is known of how metadiscourse resources have evolved over time in response to the historically developing practices of academic communities. Motivated by such an ambition, the current research drew on a corpus of 874315 words taken from three leading journals of applied linguistics in orde...
متن کاملCORIS/CODIS: A corpus of written Italian based on a defined and a dynamic model
A corpus of written Italian – CORIS – has been under construction at the Centre for Theoretical and Applied Linguistics of Bologna University (CILTA) since 1998 and will soon be completed and made available on-line. The project aims at creating a representative and sizeable general reference corpus of contemporary Italian designed to be easily accessible and user-friendly. CORIS contains 80 mil...
متن کاملMultiple Tokenizations in a Diachronic Corpus
This paper deals with the construction of a maximally flexible corpus architecture for building and analyzing diachronic corpora. Historical data poses many challenges with regard to representation and analysis, and diachronic corpora are even more varied and unsystematic (Claridge, 2008). Since historical and diachronic corpora are so difficult and expensive to build, it is crucial that they b...
متن کاملInvestigating Lexico-grammaticality in Academic Abstracts and Their Full Research Papers from a Diachronic Perspective
Development of science and academic knowledge has led to changes in academic language and transfer of information and knowledge. In this regard, the present study is an attempt to investigate lexico-grammaticality in academic abstracts and their full research papers in Linguistics, Chemistry and Electrical engineering papers published during 1991-2015 in academic journals from a diachronic pers...
متن کاملThe VENEX corpus of anaphora and deixis in spoken and written Italian
The VENEX corpus is a corpus of Italian annotated with information about anaphora and deixis, created in a joint project between the Università di Venezia and the University of Essex. The corpus includes both texts (articles from a financial newspaper) and dialogues (an Italian version of the MapTask corpus). The annotation scheme is an almost complete implementation of the scheme proposed in M...
متن کامل